Improved Document Representation for Classification Tasks for the Intelligence Community
نویسندگان
چکیده
Research within a larger, multi-faceted risk assessment project for the Intelligence Community (IC) combines Natural Language Processing (NLP) and Machine Learning techniques to detect potentially malicious shifts in the semantic content of information either accessed or produced by insiders within an organization. Our hypothesis is that the use of fewer, more discriminative linguistic features can outperform the traditional bag-of-words (BOW) representation in classification tasks. Experiments using the standard Support Vector Machine algorithm and the LibSVM algorithm compared the BOW representation and two NLP representations. Classification results on NLP-based document representation vectors achieved greater precision and recall using forty-nine times fewer features than the BOW representation. The NLP-based representations improved classification performance by producing a lower dimensional but more linearly separable feature space that modeled the problem domain more accurately. Results demonstrate that document representation using sophisticated NLP-extracted features improved text classification effectiveness and efficiency with the SVM and LibSVM algorithms.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملLearning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملDeep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning
Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...
متن کاملEvolving Better Stoplists for Document Clustering and Web Intelligence
Text classification, document clustering and similar document analysis tasks are currently the subject of significant global research, since such areas underpin web intelligence, web mining, search engine design, and so forth. A fundamental tool in such document analysis tasks is a list of so-called ‘stop’ words, called a ‘stoplist’. A stoplist is a specific collection of so-called ‘noise’ word...
متن کامل